\documentclass[a4paper]{article}

\usepackage{listings}
\usepackage{color}

\usepackage[top=2.0cm, bottom=2.0cm, left=2.0cm, right=2.0cm]{geometry}

\lstset{numbersep=5pt,numbers=left,numberstyle=\footnotesize,title=\lstname,basicstyle=\footnotesize,showspaces=false,breaklines=true}

\begin{document}

\title{Transactional storage for geo-replicated systems, overview}

\author{Jander Nascimento
\and Liviu Varga}

\maketitle

%Begin Liviu section

\section{Introduction}
This paper will give a short review of Walter, a key-value store that allows asynchronous replication. It supports transactions and replicates data across remote sites. This goals are reached by using preferred sites and counting sets, together with a protocol named Parallel Snapshot Isolation (PSI). It was coined after Snapshot Isolation(SI), an approach already largely used by commercial databases (e.g. Oracle). 

SI provides a strong guarantee within a site; Which leaves the lack for the multiple sites store. Thus, PSI provides casual ordering and precludes write-write conflicts across sites.

The need of such a system is because web applications are getting more popular and are deployed in a distributed  manner around the world (providing a better locality). Thus fault tolerance and throughput became a major concerning. 

But performance is not the only key for ensure the success of such application, the consistence and protection against conflicts are fundamental, that where transaction plays a important role in distributed system. 

The transaction of concept is to group writes in an atomic unit so that failures do not leave behind partial writes and concurrent access by other services are not intermingled(isolated). Without them there is a significant burden which have to be dealt by application developers. 

The key features of Walter are:

\begin{itemize}
\item \textit{Asynchronous replication across sites.} Replication is done lazily in the background to reduce latency
\item \textit{Efficient update-anywhere}. Counting Sets can be updated efficiently anywhere
\item \textit{Conflict-resolution logic}. Solving part of conflict issues
\item \textit{Strong isolation within each site}. Provided by PSI.
\end{itemize}
          
\section{Parallel Snapshot Isolation}
Snapshot isolation ensures that: \textbf{First}, transactions are read from a snapshot that reflects the last commit transaction, and, \textbf{secondly}, providing certain level of non-assisted conflict resolution, e.g. if two concurrent transactions have a write-write conflict, one is aborted.
 
\textit{Parallel Snapshot Isolation} extends Snapshot Isolation by allowing different sites to have different commit orderings across multiple sites. PSI's relaxation is acceptable for web applications where each user communicates with one site at a time and there is no need for a global ordering of all actions across all users. 

To avoid write-write conflicts across sites and for implementation of PSI, Walter uses two techniques:
\begin{itemize}
\item \textit{Preferred sites.} Each container is assigned to a preferred site which is the site that is allowed to write objects in that container. Meaning that the objects inside the container can be committed without checking other sites for write conflicts.
\item \textit{Conflict-free counting set objects.} A \textit{cset} is similar to a \textit{multiset}, in the sense that it keeps a count for each element, but unlike \textit{multisets}, the count could be negative for \textit{cset}. A \textit{cset} supports few operations, which can be extended. Operation $add(x)$, which add element $x$ and increments the counter for the object $x$; and an operation $rem(x)$ which decrements the counter of $x$. Because increment and decrement commute, $add$ and $rem$ also commute, and so operations never conflict. 
\end{itemize}

PSI properties:

\textsc{Property} 1. \textit{(Site Snapshot Read) All operations read the most recent committed version at the current site.}

\textsc{Property} 2. \textit{(No Write-Write Conflicts) Transaction must be disjoint}

\textsc{Property} 3. \textit{(Commit Causality Order Across Sites) The transaction must have the same order in every site. e.g. If in site \emph{A} $T_1<T_2$ then in site \emph{B} the order $T_2>T_1$ still holds.}

%may be add some other info
It is said that the transactions are \textit{somewhere-concurrent} if they are concurrent between sites, by having overlapping transactions.

The \textbf{short fork} and \textbf{long fork} anomalies may appear because of the asynchronism nature of PSI replication, which are the price paid to replicate the storage log asynchronously. Others anomalies such as dirty read, non-repeatable read, lost update or conflicting fork does not happen in PSI. 
%end Liviu section

%start Jander section

\section{Design and Algorithm}

Each site has a Walter server and several clients. For a given object is assigned a single Preferred Site. It is responsible for update (perform write) that object. Although others clients can read from it.

For a matter of organization, the objects are kept in what is called \emph{Containers}, one site may contain several containers. For instance, the administrator can create one Container per user, so he can move the container across other sites to improve locality. 

Although, it is not possible to move the object from a given container to another, since the \textit{oid} (Object ID) contains the container id as well. Thus, perform such operation would require updating the \textit{oid}'s. 

To provide a site fault tolerance, Walter relies on a \textit{Configuration Service}, which is responsible for leasing the role of Preferred Site to the containers. 

In order to perform such operation the \emph{Configuration Service}(CS) keeps a list of sites which are up and running. Example, CS detects a failure in the site \emph{A}, CS first updates the available sites list by removing site \emph{A} and assign another site to play the role of Preferred site for the site \emph{A}. 

For keeping track the sequence of operations, Walter uses vector timestamps to synchronize between sites, since the monotonic time concept is expensive and causes unnecessary dependencies(not causal ordering). Each site has a local version number indicating the number of transactions reflected in the snapshot.

There exist two types of transactions: the fast and slow commit. 

The \textit{fast commit} is used for objects which are local to the preferred site and involves two checks: first, checking if the objects are unmodified since the transaction started and second if all the objects are unlocked. 
In other corner, \textit{slow commit} is used for transactions whose object's preferred site is a non-local one. In this case the update is done in two phases: first, the server asks the objects preferred sites if they are unmodified and unlocked and if all of the sites respond with a "yes" then it proceeds to commit as a fast commit protocol. %review once more

Propagation of committed transactions are done asynchronously to the other sites. When all the transactions are committed at all sites, the transaction is marked as globally thus becomes visible to other sites. 

\section{Evaluation}

They first evaluated the base performance of Walter comparing it with Berkely DB 11gR2. The resulted transaction throughputs of read and write were comparable.

The benchmark consisted in either write or read transactions each accessing one 100 byte object. Experiments were run on Amazon's EC2 cloud platform in four sites across the globe. 

The micro-benchmark for evaluating the throughput of transactions across several sites with read/write operations as well as performing several operations within one transactions showed a linear scalability with the number of sites. Read throughput is bounded by the RPC performance and write throughput is lower compared to the read one because of the lock contention. Having several operations within one transactions increases the throughput compared with every operation being a single transaction. ReTwis (Twitter like application) running with Walter achieved a slowdown of about 25\% compared to the version running with Redis as a database but showed a doubling of throughput when scaled on two sites. 

As far as the latency concerns, when having fast commits this was mainly influenced by the effects of queuing inside the Walter server and of flushing the commit log to disk resulting a 99.9 percentile latency of 27 ms. When it comes to disaster safe transactions the latency is approximately uniform distributed between $[RTT_{MAX},2*RTT_{MAX}]$ where $RTT_{MAX}$ is the maximum round trip latency between the site which issued the transactions and other sites.
In case of WaltSocial (Facebook like application) the latency for different operations (like read-info, befriend, status update, post message) was below 50ms for the 99.9 percentile.

%end Jander section

\section{Conclusion}

The transactional geo-replicated key-value store with Parallel Snapshot Isolation support is able to replicated the change among other sites without compromising the speed and reducing the number of possible conflicts. The performance was verified using implementation of two very common services and the results presented a good scalability, although the number of tested server is still small (only 4 sites).  

\end{document}
